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Diabetes is the most common disease all over the world and it must be 
detected early to receive proper treatment, which can prevent the condition 
from becoming more severe. Automated detection plays an essential role in 
diabetes early diagnosis. Over the last few decades, many complicated 
machine learning algorithms and data analysis approaches have been applied 
for diabetes prediction. To determine the best model for early-stage diabetes 
prediction, ten different machine learning classifiers have been used in this 
study. These models were evaluated in terms of accuracy, precision, 
specificity, recall, Fl-score, negative predictive value (NPV), false positive 
rate (FPR), rate of misclassification, and receiver operating characteristics 
(ROC) curve. The experimental findings indicated that all of the models 
performed well. Gradient boosting (GB), with 97.2% accuracy, is observed 
to show the best performance on the early-stage diabetes risk prediction 
dataset. Random forest (RF) and Adaboost performed similarly to the GB; 


however, RF and Adaboost's precision was not as good as the GB precision 
(GB’s). 
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1. INTRODUCTION 

Diabetes is a metabolic disease that affects millions of individuals throughout the world. Every year, 
the rate of occurrence rises drastically. In 2014, there were around 387 million diabetics worldwide. These 
figures will have more than doubled by 2030, according to the World Health Organization (WHO) [1]. 
Diabetes-related problems in several vital organs of the body can be lethal if left untreated. Diabetes must be 
detected early to receive proper treatment, which can prevent the condition from escalating to severe 
problems [2]. As a result, better-accurate automated detection plays an essential role in diabetes early 
diagnosis [3]. Over the last few decades, many complicated machine learning algorithms and data analysis 
approaches have been developed in the medical industry, among other fields [4]. For applications like disease 
diagnosis, brain tumor detection, breast cancer detection, and therapy, machine learning technique has 
become indispensable tool in the medical profession [5]. 

Diabetes mellitus (DM), widely known as diabetes, is a collection of metabolic diseases caused 
primarily by aberrant insulin production [6]. Blood sugar levels rise when cells and/or the pancreas fail to 
produce enough insulin, damaging multiple organs, including the eyes, kidneys, and nerves. Because of this 
reason, diabetes is also known as the "silent killer". Diabetes is classified into type I diabetes, type II 
diabetes, and gestational diabetes [7]. The pancreas secretes very little or no insulin in type I diabetes. Type I 
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diabetes attacks the pancreatic cells, causing them to stop operating. Type I diabetes affects 5% to 10% of the 
population and can appear in any age group, including childhood and adolescence [8]. Type I diabetes 
accounts for over 90% of all diabetes cases globally. It develops when the body's insulin production is 
insufficient [9]. Both adults and children can develop type II diabetes. Gestational diabetes mellitus (GDM) 
is a third kind of diabetes similar to type II diabetes in that it is caused by an insufficient balance of insulin 
secretion and responsiveness. This medical issue develops over time as a result of excessive blood pressure 
and hypertension. Gestational diabetes affects about 2—10% of all pregnant women, and it can progress or 
disappear after delivery [10]. Diabetes can signal the onset of other diseases. Researchers around the globe 
are working tirelessly to tackle the illness by creating effective prediction and detection tools and viable 
therapies [11]. Machine learning approaches play an essential part in predicting this disease in this context. 
The best-preferred methodology for the categorization of labeled data is classification, which is a supervised 
machine learning technique [12]. 

The literature study in this section reviews some of the more well-known early efforts on diabetes 
prediction using variously supervised and unsupervised machine learning algorithms and also comparisons 
between different classification methods. A comparison of random forest (RF), K-means clustering, and 
artificial neural network (ANN) approach for diabetes prediction had been done in [13]. The ANN method 
had the highest accuracy of 75.7%. On "Pima Indian diabetes dataset" (PIDD), [14], [15] discovered that the 
Naive Bayes (NB) classifier outperforms the support vector machine (SVM), NB, and decision tree (DT) 
machine learning algorithms, with an accuracy of 76.30%. Sadeghi et al. [16], utilized the methods of deep 
neural network (DNN), extreme gradient boosting (XGBoost), and RF for predicting diabetes in Tehran Lipid 
and Glucose Study (TLGS) cohort data in which DNN outperformed with the highest accuracy. On PIDD, 
[17] applied machine learning techniques in which generalized boosted regression modeling showed the 
highest accuracy of 90.91%. Zecchin et al. [18] and Reddy et al. [19] employed a neural network (NN) and a 
polynomial model to predict short-term blood glucose. This plan requires continual monitoring and sample- 
giving, which is time-consuming. 

The previous brief literature study and the summarised literature analysis are presented in Table 1. 
Analysis of the literature demonstrates that there is no single algorithm that is best for all issues, several 
factors, such as the organization and size of the dataset, are critical. Some of the classifiers used include DT, 
RF, GB, principal component analysis (PCA), k-nearest neighbor (KNN), expectation maximization (EM), 
NN, logistic regression (LR), radial basis function (RBF), multifactor dimensionality reduction (MDR). 


Table 1. List of relevant works from various literature 


References Classifiers applied Claimed result Constraints Highest Dataset used 
classification 
accuracy (%) 
Zou et al.[20] DT, RF, and NN RF predicts diabetes Only the glucose index 80.84 (RF) Luzhou and Pima 
with 80% sensitivity, did well. For effective Indian diabetes 
89% specificity, and results, more indexes dataset 
85% accuracy. are required. 
Heydari et al. DT, SVM, ANN, and ANN outperforms Doctors classify data 97.44 (ANN) Dataset from 
[21] Bayesian networks with an accuracy of mining. Experts Tabriz University 
SNN 97.44%. should review disease of Medical 
data mining. Sciences, Iran 
Wuet al. K-means clustering Improved K-means Data preparation is 95.42 (K- Pima Indian 
[22] and LR algorithm performs time-consuming. means) diabetes dataset 
well 
Nilashi et PCA-KNN, PCA- EM-PCA-fuzzy rule- Compared to 92.9 (EM-PCA) Pima Indian 
al.[23] SVM, EM, PCA-fuzzy _ based performs well enormous healthcare diabetes dataset 
rule-based data, this dataset's data 
is simple. 
Husain and LR, KNN, RF, GB, Ensemble method There are concerns 75 (ensemble National Health 
Khan [24] ensemble method effectively predict with under-fitting, method) and Nutrition 
diabetes with 75% over-fitting, and high Examination 
accuracy. training time Survey 
overhead, (NHANES) 
2013-14 dataset 
Sarwar and ANN, KNN, and NB ANN predicted 96% This study's dataset 96 (ANN) 500 participants 
Sharma [25] accurately, followed could be expanded to were picked at 
by KNN (91%) and include clinical factors random from 
NB (95%). to compare them. various social 
groups. 
Kaur and ANN, k-NN, SVM- k-NN and SVM linear a limited number of 89 (Linear Pima Indian 
Kumari[26] linear, SVM-RBF and _identify diabetes best. parameters kernel SVM) diabetes dataset 


MDR 
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Diabetes prediction is difficult. This study uses the early-stage diabetes dataset (ESDD) to determine 
the best classifier model among ten for predicting outcomes. For a thorough evaluation, nine measures were 
used: confusion matrix, classification accuracy, precision, recall/sensitivity/ true positive rate (TPR), false 
positive rate (FPR), negative predictive value (NPV), rate of misclassification, Fl-score, and receiver 
operating characteristics (ROC) curve. Contributions: i) to determine the best model for early-stage diabetes 
prediction, ten different machine learning classifiers, including KNN, ANN, DT, stochastic gradient descent 
(SGD), RF, SVM, GB, NB, AdaBoost, and LR, have been used; and 11) these models were evaluated in terms 
of accuracy, precision, specificity, recall, Fl-score, NPV, FPR, rate of misclassification, and ROC curve. 

The following is how the paper is organized: section | discusses the introduction and several 
connected literature surveys. Section 2 contains a description of various classification methods. 
The numerous performance evaluation measures of classifiers are depicted in detail in section 3. 
The experimental approach, its setup, and the description of the dataset have been detailed in section 4. In 
section 5, obtained results are discussed. Finally, in section 6, this study comes to a conclusion with 
implications for the future. 


2. DESCRIPTION OF CLASSIFICATION METHODS 

The purpose of this section is to provide some background information on ten different classifiers 
used in this study. This information helps to provide some context for the classification methods. In this 
section, a concise explanation of each of these ten different classifications is discussed. 


2.1. Decision tree classifier 

DT are supervised learning systems that can address classification and regression problems, but in 
general, they are used to resolve classification problems. It is a tree-like layout in which each node represents 
a feature value check, each branch represents a test activity result, and the tree's leaf nodes represent 
classifications. It may quickly produce intelligible criteria and classify data with minimal processing. 


2.2. k-nearest neighbor classifier 

This predictive algorithm is suitable for a lazy learning technique prediction mechanism that generates 
predictions based on the KNN input. When the predictions of any occurrence are requested, the full process of 
prediction is completed. The Euclidian distance method is frequently used to determine proximity [27]. 


2.3. Support vector machine classifier 

SVM is based on the approach of finding a hyper-plane to separate binary classes of the dataset. 
Both linear and nonlinear datasets function well with this approach. When the dataset has a large number of 
attributes, the SVM performs much better. 


2.4. Naive Bayes classifier 

For high-dimensional inputs, this classifier prefers to employ the Bayesian theorem. The Bayes 
theorem is used with strong and independent hypotheses in a NB model. The essential assumption of the NB 
technique is that a particular characteristic of a class is independent of every other property of that class. This 
technique yields incredible precision when the underlying premise is false [28]. 


2.5. Logistic regression classifier 

Instead of forecasts, the logistic regression model produces likelihood approximations. For binary 
classification, this method is appropriate. The likelihood of every event occurring is handled as a linear 
transformation of a collection of input characteristics in this. 


2.6. Random forest classifier 

It is a common machine learning approach that uses the results of several DT constructed on 
different sets of the dataset to produce forecasts. It is a regression and classification classifier. The mean of 
all the decision tree outcomes is computed in regression, and the voting from the several DT is pooled to get 
the final result in classification. 


2.7. Artificial neural network classifier 

ANN is a supervised classification approach. In this, the neural architecture of the human brain is 
being implemented as a form of software with the aim to simplify and emulate brain activity. ANN is a group 
of artificial neurons that absorb data, change their internal state, known as activation, and generate output 
based on the input provided and used activation function [29]. 
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2.8. Adaboost classifier 

Adaboost has the benefits of being easier to develop, having fewer variables to choose from, and 
having high generality. Even if it only gives sub-optimal results and is vulnerable to extremes and 
inconsistent data [30]. Gradient boosting (GB) is a method for predicting the residuals of previous models by 
creating new models, which are then integrated to generate a decision boundary. 


2.9. Gradient boosting classifier 

Boosting algorithms iteratively aggregate weak learners or those who are slightly better than random 
into strong learners over time. GB is a regression approach that is similar to boosting. GB aims to obtain an 
estimate of the function that maps samples to their output values by minimizing the estimated value of an 
error function given a training sample. 


2.10. Stochastic gradient descent classifier 

SGD is a straightforward method for locating the local minima of a function whose values are 
tainted by noisy data. It is a popular optimization method in machine learning. By iteratively computing the 
gradients of an error function on a single training instance or a batch of a few instances and revising the 
parameters of the model appropriately, this approach minimizes the error of a model [31]. 


3. CLASSIFIER PERFORMANCE EVALUATION METRICS 

To successfully evaluate any effects of the algorithm, specific performance metrics must be defined 
that may be used to assess the quality of any classification model under evaluation. In this study, we used 
nine different metrics to evaluate the performance of classifiers. A brief description of these measures is as: 


3.1. Confusion matrix 

The confusion matrix (CM) is a table that summarizes an evaluation of the efficacy of a 
classification model. The learning process yields correct results when it has diagonal entries. True positive 
(TP): training situations in which we hypothesized that the real class was positive. False positive (FP): this 
shows that the learning system is wrongly recognizing the instances as positive when they are actually 
negative. True negative (FN): there are certain training situations where the real class is negative, and we 
hypothesize that the true class is negative. False negative (FN): this shows that the learning system is 
wrongly classifying the events as negative when they are actually positive. We may assess the classifier's 
performance using the confusion matrix. 


3.2. Classification accuracy 
The overall success rate of the classifiers is displayed here. This success rate is expressed as a 
percentage of all correct predictions. Accuracy can be expressed mathematically as the following equation: 


Accuracy = (TP + TN)/(TP + TN + FP + FN) 


3.3. Precision 
It is one of the important metrics to evaluate the performance of classifiers. It is defined as the ratio 
of true positive to the sum of true positive and false positive. The formula for calculating precision is as: 


Precision = (TP) /(TP + FP) 


3.4. Recall/Sensitivity/TP rate 

True positive recall is often known simply as recall. It is a statistic defined as the ratio of true 
positive results to the total of true positive and false-negative results. Recall can be expressed mathematically 
as the following equation: 


Sensitivity = (TP) /(TP + FN) 
3.5. False positive rate (FPR) 
A FP value is a number that is higher than the sum of FP values and the true negative value. This is 


called the FP rate. The formula for calculating the FP rate is as: 


FPrate = (FP) /(FP + TN) 
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3.6. Negative predictive value (NPV) 

It is also an important metric to evaluate the performance of classifiers. The NPV is defined as 
the relationship between the TN value and the sum of the TN and FN values. It can be calculated 
mathematically as: 


NPV = (TN) /(TN + FN) 


3.7. Rate of misclassification 

It is calculated as the proportion of incorrectly identified samples to the total number of samples. 
Erroneous classifications can be divided into two categories. If the presence of diabetes disease is 
misclassified as the absence of diabetes disease, this is a Type-I Error (E1). If a patient's absence of diabetic 
illness is interpreted as a diagnosis of diabetes, this is a Type-II Error (E2). 


RateMisclassification (E1 + E2) = (FP + FN)/(TP + TN + FP + FN) 


3.8. F1-score 

The F-measure is another name for the Fl-score. This word refers to the value that is calculated by 
taking the harmonic mean of the accuracy and recall. Its value of 0 indicates the worst performance, whereas 
F-measure equal to 1 indicates the best performance. 


F1—Score = 2 * precision * recall / (precision + recall) 


3.9. ROC curve 

ROC FPR data are plotted against TPR values on a graph with the x-axis representing FPR and the 
y-axis representing the TPR values. This statistic evaluates a model's ability to discriminate between classes 
and how effective it is at doing so. The larger the area under the curve (AUC), the better the classifier will be 
at distinguishing between individuals with and without the condition. 


4. EXPERIMENTAL SETUP 

Experiments were used to test the effectiveness and efficiency of various machine-learning 
algorithms and classifiers. Orange machine learning and data mining toolbox was used to test the classifiers. 
This toolbox comprises ML algorithms for classification, regression, data preprocessing, association rules, 
and clustering. It has a graphical data analysis interface. It allows widgets to be used as data processing 
points on a canvas, related by workflow lines. Figure 1 illustrates the orange data mining toolkit's 
experimental widget parts and workflow lines shows in appendix. 


4.1. Dataset description 

This empirical study uses the Early Stage Diabetes Risk Prediction dataset from the UCI machine 
learning archive. There are 520 instances in this dataset, and each instance has 17 attributes. There are 400 
positive samples and 120 negative samples in total. Table 2 summarizes the dataset's features and their Chi2 
values. For study reasons, all instances in the dataset without diabetes were assigned to the NEGATIVE (0) 
class, whereas instances with diabetes were assigned to the POSITIVE (1) class. 


Table 2. Dataset features information and Chi2 values for different features 


S.No. _ Attribute name Abbreviation _ Values Chi? p 

1 Age age 20-65 16.05 0.001 
2: Sex sex Male/Female 104.94 0 

3 Weakness wk Yes/No 30.77 0 

4 Polydipsia pd Yes/No 218.84 0 

3 Partial Paresis pp Yes/No 97.17 0 

6 Polyuria pu Yes/No 230.6 0 

7 Polyphagia pph Yes/No 61 0 

8 Muscle Stiffness ms Yes/No 78 0.005 
9 Delayed Healing dh Yes/No 1.15 0.284 
10 Alopecia alp Yes/No 37.21 0 

11 Obesity obs Yes/No 2.71 0.1 
12 Visual Blurring vb Yes/No 32.84 0 

13 Sudden Weight Loss — swl Yes/No 99.11 0 

14 Genital Thrush gt Yes/No 6.32 0.012 
15 Irritability irr Yes/No 46.63 0 

16 Itching ich Yes/No 0.09 0.76 
17 Class class Positive/Negative. - - 
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We selected the Chi-square test for feature selection because the dataset contains mostly categorical 
attributes, except 'Age,' and the target feature is categorical. We found that Polyuria and Polydipsia had the 
highest Chi-squared of all 17 attributes. Table 2 also shows poor Chi-squared values for itching and delayed 
healing. Based on the Chi-squared test, we rejected 'itching' and ‘delayed healing.’ 


5. RESULTS AND DISCUSSION 

The experiment used the Early-Stage Diabetes Risk Prediction dataset to detect diabetes disease 
using random samples, stratified shuffle split, and tenfold cross-validation with an 80% training dataset size. 
A confusion matrix measures classifier performance. Table 3 summarizes confusion matrices for all ten 
classifiers studied. According to the results of all the ten potential classifiers used in this study shows in 
Table 4, RF and Adaboost both achieved a classification accuracy of 97%, whereas ANN, SGD, SVM, 
Logistic regression, NB, and DT achieved a classification accuracy of 95.6%, 90.2%, 96.3%, 90.2%, 88.4%, 
and 94.7%, respectively. KNN has the lowest accuracy of 88.2%, while GB excels with a 97.2% accuracy 
(refer to Table 4). 


Table 3. Summarised depiction of confusion matrices for all ten classifiers 
GB Tree RF SVM _KNN _AdaBoost ANN LR SGD NB 


TN 199 198 199 195 197 200 197 179 180 180 
FP 1 2 1 5 3 0 3 21 20 20 
FN 1 9 1 6 25 0 3 22 15 45 
TP 319 311 319 314 295 320 317 298 305 275 


Table 4. Performance statistics of Ten classifiers 
Model _ AUC Fl Precision Recall | Accuracy (%) NPV (%) FPR(%) RMC(%) 


k-NN_ 0.954 (0.883 0.894 0.882 88.2 88.74 1.5 5.385 
Tree 0.943 0.947 0.947 0.947 94.7 95.65 1 2.115 
SVM 0.991 0.962 0.962 0.963 96.3 97.01 25 2.115 
SGD 0.896 ~—-(0.902 0.902 0.902 90.2 92.31 10 6.731 
RF 0.996 0.966 0.967 0.966 96.6 99.5 0.5 0.385 
NN 0.991 0.956 0.956 0.956 95.6 98.5 1.5 1.154 
NB 0.953 0.885 0.891 0.884 88.4 80 10 12.5 
LR 0.964 0.902 0.902 0.902 90.2 89.05 10.5, 8.269 
GB 0.988 0.972 0.972 0.972 97.2 99.5 0.5 0.385 
AB 0.967 0.966 0.966 0.965 96.5 100 0 0 


The Fl-score is the sum of the recall and precision ratios; a higher value implies a model's 
classification capability; the classifiers LR and SGD have the same Fl-score value, 0.902. Similarly, the 
classifiers RF and Adaboost produce equal Fl-score value, i.e., 0.966; however, ANN, DT, NB, and Logistic 
regression demonstrates the Fl-score value of 0.956, 0.947, 0.885, and 0.902. KNN offered the lowest F1 - 
SCORE of 0.883 while GB outperformed with an Fl-score of 0.972. In terms of the rate of misclassification 
provided by classifiers, Adaboost surpassed all other classifiers with a 0% rate of misclassification, while NB 
has the lowest RMC of 12.5%. For other classifiers, RMC ranged from 0.4 % to 8.3 %. In terms of the NPV 
provided by classifiers, Adaboost surpassed all other classifiers with 100%, while NB has the lowest NPV of 
80%. For other classifiers, NPV ranged from 88.7 % to 99.5 %. 

Except for SGD (0.896), DT (0.943), NB (0.953), KNN (0.954), LR (0.964), Adaboost (0.967), and 
GB (0.988), the other two classifiers, SVM and ANN, have an AUC value of 0.99, indicating greater 
classification performance shows in Table 3. In terms of the patient, the negative class prediction is the most 
sensitive; if the classification is erroneous, the negative patient condition may become dangerous, as the 
patient would not be treated for a negative case. AdaBoost has an FPR of 0%, while all other classifiers have 
FPRs ranging from 1% to 11%, as shown in Table 4. On the early-stage diabetes risk prediction dataset, the 
AdaBoost classifier is expected to produce the best results. The ROC curves in Figures 2(a) and 2(b) 
illustrate the primary reason for the enhanced results obtained by using the AdaBoost classifier. The 
fundamental objective of employing the AdaBoost classifier is to achieve superior results. Adaboost makes it 
possible to merge several "weak classifiers" into a single "strong classifier." The weak learners in AdaBoost 
are decision trees with a single split, often known as decision stumps. The AdaBoost classifier works by 
giving more weight to cases that are hard to categorize and less weight to those that are already well- 
classified. As a result, it is well suited to the dataset used in this study. 
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Figure 2. ROC analysis of (a) negative class and (b) positive class 


6. CONCLUSION 

Diabetes is a disease that affects a large number of people. Diabetes may be detected early, which 
not only lowers treatment costs but also saves lives. In the prediction of disease, a reliable prediction system 
is quite helpful. The chi-squared test is used for feature selection. This research also implies that the Chi- 
squared test can be used for feature selection for small datasets and that it is preferable to select attributes 
without medical domain knowledge. Thus, with a limited number of parameters, we were able to achieve the 
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higher performance standards of Machine Learning models using this feature selection strategy. To determine 
the best model for early-stage diabetes prediction, ten different machine learning classifiers, including KNN, 
ANN, DT, SGD, RF, SVM, GB, NB, AdaBoost, and LR, have been used. These models were evaluated in 
terms of accuracy, precision, specificity, recall, Fl-score, NPV, FPR, rate of misclassification, and ROC 
curve. The experimental findings indicated that all of the models performed well. GB, with 97.2% accuracy, 
is the best performance on the Early-Stage Diabetes Risk Prediction Dataset. RF and Adaboost performed 
similarly to the GB; however, RF and Adaboost's precision was not as good as the GB's. With an accuracy of 
88.2%, the KNN classifier was the worst performer in classification. In the center of the performance range 
of RF and KNN were SVM, RF, ANN, SGD, and LR. When it came to Recall and F1 score, GB topped the 
pack with both of those numbers at 0.972. Because our dataset is an example of an unbalanced dataset, the F1 
score offers a better understanding of our models' performance. The Fl score achieves a good mix of 
precision and recall. It can also be noted that RF has the greatest AUC value (0.996). AUC of this magnitude 
implies that RF is a trustworthy model. Finally, we may conclude that, among all the classifiers tested in this 
work, GB and REF are the best for predicting early-stage diabetes. As this study evaluates the performance of 
models individually, and the performance of the combination of these models is yet to be explored. Future 
work will focus on developing prediction models using the ensemble methods like soft or hard voting 
classifiers or the development of hybrid classifiers, following to enhance those models for better 
performance. 
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Figure 1. Orange data mining toolkit's experimental setup 
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